PDTSSE: A Scalable Parallel Decision Tree Algorithm Based on MapReduce

نویسندگان

Yan Cui

Yuanyang Yang

Shizhong Liao

چکیده

Parallel decision tree learning is an effective and efficient approach to scaling the decision tree to large data mining application. Aiming at large scale decision tree learning, we present a novel parallel decision tree learning algorithm in MapReduce framework, called PDTSSE (Parallel Decision Tree via Sampling Splitting points with Estimation). We first propose an estimation method for sampling splitting points, which can effectively handle both categorical and numeric attributes over large scale data. We also derive an error bound for the algorithm, and analyze the computational complexities of the algorithm. Finally, we describe the implementation procedures in MapReduce framework. Theoretical analysis and experimental results show that PDTSSE has low computational cost compared to state of the art classifiers while maintaining the quality of the generated trees in terms of accuracy, and can scale to large scale data mining application. Keywords-Classification; Decision Tree; Sampling; Parallel, MapReduce

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce

Classification and regression tree learning on massive datasets is a common data mining task at Google, yet many state of the art tree learning algorithms require training data to reside in memory on a single machine. While more scalable implementations of tree learning have been proposed, they typically require specialized parallel computing architectures. In contrast, the majority of Google’s...

متن کامل

MR-Tree - A Scalable MapReduce Algorithm for Building Decision Trees

Learning decision trees against very large amounts of data is not practical on single node computers due to the huge amount of calculations required by this process. Apache Hadoop is a large scale distributed computing platform that runs on commodity hardware clusters and can be used successfully for data mining task against very large datasets. This work presents a parallel decision tree learn...

متن کامل

A Scalable Expressive Ensemble Learning Using Random Prism: A MapReduce Approach

The induction of classification rules from previously unseen examples is one of the most important data mining tasks in science as well as commercial applications. In order to reduce the influence of noise in the data, ensemble learners are often applied. However, most ensemble learners are based on decision tree classifiers which are affected by noise. The Random Prism classifier has recently ...

متن کامل

Large - Scale Non - Linear Regression within the Mapreduce Framework

Large-scale Non-linear Regression within the MapReduce Framework By: Ahmed Khademzadeh Thesis Advisor: Philip Chan, Ph.D. Regression models have many applications in real world problems such as finance, epidemiology, environmental science, etc.. Big datasets are everywhere these days, and bigger datasets would help us to construct better models from the data. The issue with big datasets is that...

متن کامل

Data Deduplication in Parallel Mining of Frequent Item sets using MapReduce

A Parallel Frequent Item sets mining algorithm called FiDoop using MapReduce programming model. FiDoop includes the frequent items ultrametric tree(FIU-tree), in that three MapReduce jobs are applied to complete the mining task. The scalability problem has been addressed bythe implementation of a handful of FP-growth-like parallelFIM algorithms. InFiDoop, the mappers independently and concurren...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2013

PDTSSE: A Scalable Parallel Decision Tree Algorithm Based on MapReduce

نویسندگان

چکیده

منابع مشابه

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce

MR-Tree - A Scalable MapReduce Algorithm for Building Decision Trees

A Scalable Expressive Ensemble Learning Using Random Prism: A MapReduce Approach

Large - Scale Non - Linear Regression within the Mapreduce Framework

Data Deduplication in Parallel Mining of Frequent Item sets using MapReduce

عنوان ژورنال:

اشتراک گذاری